Boosting¶
Try again. Fail again. Fail better. Samuel Beckett.
- I Contain Multitudes
1.1 Decision Trees
1.2 A Song of Bias and Variance
1.3 Bagging vs Boosting - The Strength of Weak Learners
2.1 A short trip to Classtown
2.2 The weakest of learners - The AdaBoost Algorithm:
3.1 Decision Stumps and Decision Boundaries
3.2 Oops, I did it again: AdaBoost Intuition
3.3 The Algorithm - Gradient Boosting
4.1 All I Want for Boosting is Gradient - AdaBoost in Practice
1. I Contain Multitudes¶
1.1 Decision Trees¶
1.1 Decision Trees¶
Define impurity measure, e.g Gini impurity index: $G(N) = \sum_{k}p_k(1-p_k)$.
Algorithm time complexity: O(?)
1.2 A song of Bias and Variance¶
1.2 A song of Bias and Variance¶
- $Y=f(\bf{X}) + \epsilon$: true relationship between Y observations and X covariates.
- $\epsilon$: Gaussian noise with zero mean and standard deviation $\sigma_\epsilon$.
- $\hat{f}(\bf(x))$: model. $$ Err(x)=E[(Y - \hat{f}(x))^2] $$
$$ Err(x)=(E[\hat{f}(x)] - f(x))^2 + E[(\hat{f}(x) - E[\hat{f}(x)])^2] + {\sigma_{\epsilon}}^2 $$
$$ Err(x)=Bias^2 + Variance + irreducible\quad error $$
1.3 Bagging vs Boosting¶
2. The strength of weak learners¶
Hypothesis Boosting Problem (Michael Kerns, 1988)
2.1 A short Trip to Classtown¶
- Destination: $\{(x_1, 1), (x_2, 1), (x_3, -1)\}$
- labels form a point in $\mathbb{R}^3$
- movement hypothesis $h$: $(h(x_1), h(x_2), h(x_3))$
- total movement: $H=\sum_\limits{t}\alpha_th_t(x)$
2.2 Does it actually work?¶
3.1 AdaBoost: Decision Stumps and Decision Boundaries¶
Yes, a weak learner can be boosted into a strong one (Schapire, 1990).
Decision stump: a decision tree with only one split.
Decision boundary: line separating classes in a classification problem.
3.2 Oops, I did it again: AdaBoost Intuition¶
- Train a decision stump that learns a model to ensure that training examples with higher weights are prioritized.
- Update the weights of the training examples such that misclassified examples are assigned higher weights; the worse the error, the higher the weight.
3.3 AdaBoost: the Algorithm¶
Train a weak learner $h_t (x)$ using the weighted training examples, $(x_i,y_i,D_i)$
- Compute the training error $\epsilon_t$ of the weak learner $h_t(x)$
- Compute the weight of the weak learner $\alpha_t$ that depends on $\epsilon_t$.
- $\epsilon_t=\sum_\limits{i:y_i\neq h(x_i)}D_i$
Update the weights of the training examples
- $D_i e^{-\alpha_ty_{i}h(x_i)}$
How is the weight changing for examples correctly classified? How for wrong ones?
The overall classifier after t iterations is just a weighted ensemble:
$$ H(x)= \sum_{t=1}^T \alpha_t h_t (x). $$
4.1 All I want for Boosting is Gradient¶
Branin function as a test function to visualize gradient descent. The Branin function is a function of two variables $w_1$ and $w_2$:
$$ f(w_1, w_2) = a (w_2 - b w_1^2 + c w_1 - r)^2 + s (1-t) \cos{w_1} + s $$
where a = 1, b = 5.1/4Ï€2, c = 5/Ï€, r = 6, s = 10, and t = 1/8Ï€ are fixed constants.
4.2 Gradient Boosting Intuition¶
- Loss function $L$
- Model at step $t$: $H_t(x)=H_{t-1}(x) + \alpha_t h_t (x)$
- Choose $h_t(x)$ to be the closest to the direction of the negative gradient of $L$: $-\frac{\partial L}{\partial H}\big|_{H=H_{t-1}(x)}$